Systematic Characterizations of Text Similarity in Full Text Biomedical Publications

نویسندگان

  • Zhaohui Sun
  • Mounir Errami
  • Tara Long
  • Chris Renard
  • Nishant Choradia
  • Harold Garner
چکیده

BACKGROUND Computational methods have been used to find duplicate biomedical publications in MEDLINE. Full text articles are becoming increasingly available, yet the similarities among them have not been systematically studied. Here, we quantitatively investigated the full text similarity of biomedical publications in PubMed Central. METHODOLOGY/PRINCIPAL FINDINGS 72,011 full text articles from PubMed Central (PMC) were parsed to generate three different datasets: full texts, sections, and paragraphs. Text similarity comparisons were performed on these datasets using the text similarity algorithm eTBLAST. We measured the frequency of similar text pairs and compared it among different datasets. We found that high abstract similarity can be used to predict high full text similarity with a specificity of 20.1% (95% CI [17.3%, 23.1%]) and sensitivity of 99.999%. Abstract similarity and full text similarity have a moderate correlation (Pearson correlation coefficient: -0.423) when the similarity ratio is above 0.4. Among pairs of articles in PMC, method sections are found to be the most repetitive (frequency of similar pairs, methods: 0.029, introduction: 0.0076, results: 0.0043). In contrast, among a set of manually verified duplicate articles, results are the most repetitive sections (frequency of similar pairs, results: 0.94, methods: 0.89, introduction: 0.82). Repetition of introduction and methods sections is more likely to be committed by the same authors (odds of a highly similar pair having at least one shared author, introduction: 2.31, methods: 1.83, results: 1.03). There is also significantly more similarity in pairs of review articles than in pairs containing one review and one nonreview paper (frequency of similar pairs: 0.0167 and 0.0023, respectively). CONCLUSION/SIGNIFICANCE While quantifying abstract similarity is an effective approach for finding duplicate citations, a comprehensive full text analysis is necessary to uncover all potential duplicate citations in the scientific literature and is helpful when establishing ethical guidelines for scientific publications.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dysphagia Improvement Using Acupuncture Therapy: A Systematic Review

Background. Dysphagia is a common complication in patients with stroke. The research on acupuncture treatment of dysphagia has increased, but the results are not consistent. In this review we intend to answer “what is the potential of acupuncture in treating dysphagia in stroke patients and which acupuncture points are the most promising for treating dysphagia?” Methods. This systematic review...

متن کامل

Future competencies for hospital management in developing countries: Systematic review

    Background: This was a systematic review presenting the future competencies for hospital managers.     Methods: Participants, interventions, comparisons and outcomes (PICO) strategy with MeSH terms were used for searching. Databases used were Web of Science, PsycINFO and Medline, EBSCO, ScienceDirect, Emerald, ProQuest, Social Sciences Research Network, Embase, and some Iranian database su...

متن کامل

Semantics - based Text Mining of Biomedical Concepts in

Searching publications for prior work on scientific concepts is central to the research process. The relevant parts of retrieved publications are typically found and evaluated manually. In the field of biomedicine, due to rapidly growing numbers of publications and the of lack standard scientific terminologies, this task is particularly challenging, complex and time consuming. Prior information...

متن کامل

Distribution of information in biomedical abstracts and full-text publications

MOTIVATION Full-text documents potentially hold more information than their abstracts, but require more resources for processing. We investigated the added value of full text over abstracts in terms of information content and occurrences of gene symbol--gene name combinations that can resolve gene-symbol ambiguity. RESULTS We analyzed a set of 3902 biomedical full-text articles. Different key...

متن کامل

Figure Text Extraction in Biomedical Literature

BACKGROUND Figures are ubiquitous in biomedical full-text articles, and they represent important biomedical knowledge. However, the sheer volume of biomedical publications has made it necessary to develop computational approaches for accessing figures. Therefore, we are developing the Biomedical Figure Search engine (http://figuresearch.askHERMES.org) to allow bioscientists to access figures ef...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2010